PROSPER LOAN

by TONG LI

Introduction

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

In this project, I am going to explore the relationship between the loan amount (LoanOriginalAmount in the dataset) and some other variables selected from the original dataset.

Data Analysis

1. Variables selection through correlation matrix and univariate plots

1.1 General selection

Since the dataset contains many variables, before doing any data analysis I selected 10-15 variables I would like to explore. First, I excluded all variables with proportion of missing values larger than 10%.

## [1] 113937     81

1.2 Numeric variables selection

Second, in the variables with missing values fewer than 10%, I selected all the numeric variables and explore their correlation with LoanOriginalAmount. Because there are too many variables and an ordinary correlation matrix will be too big (47 by 47), I will use a flattened correlation table instead. This table contains four columns: (variable) i, (variable) j, cor, & p. Each row contains the correlation coefficient and the p value of one pair of the numeric variables.

By doing the above steps, in addition to LoanOriginalAmount, I selected 6 numeric variables whose correlation magnitude with LoanOriginalAmount were approximately between 0.3 and 0.9.

1.3 Factors selection

Next, I selected appropriate factors by plotting their distributions and excluding the ones with extreme ceiling or floor.

Four factors were selected.

Then the 7 numeric variales and 4 factors were selected from the dataset. A new variable “LoanOriginalQuarter” was created from the original variable “LoanOriginationQuarter”, with only Q1, Q2, Q3 and Q4 information in the new variable but not the specific years.

2. Univariate Analysis

2.1 Univariate Plots

## 'data.frame':    113937 obs. of  11 variables:
##  $ LoanOriginalAmount   : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ MonthlyLoanPayment   : num  330 319 123 321 564 ...
##  $ Investors            : int  258 1 41 158 20 1 1 1 1 1 ...
##  $ LP_CustomerPayments  : num  11396 0 4187 5143 2820 ...
##  $ LP_ServiceFees       : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ Term                 : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ CreditScoreRangeLower: int  640 680 480 800 680 740 680 700 820 820 ...
##  $ IncomeRange          : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IsBorrowerHomeowner  : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ LoanOriginalQuarter  : Factor w/ 4 levels "Q1","Q2","Q3",..: 3 1 1 4 3 4 2 2 4 4 ...
##  $ EmploymentStatus     : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
##  LoanOriginalAmount MonthlyLoanPayment   Investors      
##  Min.   : 1000      Min.   :   0.0     Min.   :   1.00  
##  1st Qu.: 4000      1st Qu.: 131.6     1st Qu.:   2.00  
##  Median : 6500      Median : 217.7     Median :  44.00  
##  Mean   : 8337      Mean   : 272.5     Mean   :  80.48  
##  3rd Qu.:12000      3rd Qu.: 371.6     3rd Qu.: 115.00  
##  Max.   :35000      Max.   :2251.5     Max.   :1189.00  
##                                                         
##  LP_CustomerPayments LP_ServiceFees         Term      
##  Min.   :   -2.35    Min.   :-664.87   Min.   :12.00  
##  1st Qu.: 1005.76    1st Qu.: -73.18   1st Qu.:36.00  
##  Median : 2583.83    Median : -34.44   Median :36.00  
##  Mean   : 4183.08    Mean   : -54.73   Mean   :40.83  
##  3rd Qu.: 5548.40    3rd Qu.: -13.92   3rd Qu.:36.00  
##  Max.   :40702.39    Max.   :  32.06   Max.   :60.00  
##                                                       
##  CreditScoreRangeLower         IncomeRange    IsBorrowerHomeowner
##  Min.   :  0.0         $25,000-49,999:32192   False:56459        
##  1st Qu.:660.0         $50,000-74,999:31050   True :57478        
##  Median :680.0         $100,000+     :17337                      
##  Mean   :685.6         $75,000-99,999:16916                      
##  3rd Qu.:720.0         Not displayed : 7741                      
##  Max.   :880.0         $1-24,999     : 7274                      
##  NA's   :591           (Other)       : 1427                      
##  LoanOriginalQuarter      EmploymentStatus
##  Q1:29678            Employed     :67322  
##  Q2:24906            Full-time    :26355  
##  Q3:27967            Self-employed: 6134  
##  Q4:31386            Not available: 5347  
##                      Other        : 3806  
##                      (Other)      : 2718  
##                      NA's         : 2255

Histogram of numeric variables:

Although Term is a numeric variable, it actually only has three levels: 12, 36, and 60.

All those numeric variables have skewed distributions (except for Term), and they are also on very different scales. Therefore they need to be standardized and transformed. Before doing the log transformation for each variable, an appropriate integer will be added to each datum based on the range of the standardized values to avoid NAN generated by zero values. The new distributions (after being standardized and log transformed) are as follows:

##  LoanOriginalAmount MonthlyLoanPayment   Investors      
##  Min.   :-1.1747    Min.   :-1.4140    Min.   :-0.7698  
##  1st Qu.:-0.6944    1st Qu.:-0.7310    1st Qu.:-0.7601  
##  Median :-0.2941    Median :-0.2841    Median :-0.3533  
##  Mean   : 0.0000    Mean   : 0.0000    Mean   : 0.0000  
##  3rd Qu.: 0.5865    3rd Qu.: 0.5143    3rd Qu.: 0.3344  
##  Max.   : 4.2689    Max.   :10.2701    Max.   :10.7375  
##                                                         
##  LP_CustomerPayments LP_ServiceFees     CreditScoreRangeLower
##  Min.   :-0.8736     Min.   :-10.0559   Min.   :-10.3158     
##  1st Qu.:-0.6632     1st Qu.: -0.3041   1st Qu.: -0.3847     
##  Median :-0.3338     Median :  0.3343   Median : -0.0838     
##  Mean   : 0.0000     Mean   :  0.0000   Mean   :  0.0000     
##  3rd Qu.: 0.2850     3rd Qu.:  0.6725   3rd Qu.:  0.5181     
##  Max.   : 7.6226     Max.   :  1.4303   Max.   :  2.9256     
##                                         NA's   :591

Distribution of factors:

The order of the IncomeRange levels needs to be rearranged. In addition, the labels on the x axis were too long and some were overlapped. After adjusted it was plotted again:

The types of Employment Status were confusing: ‘Full-time’, ‘Part-time’ and ‘Self-employed’ should all be considered as ‘Employed’. Therefore, another variable was created in which all the three types were labeled as ‘Employed’ as well.

However, since the loans with a status of ‘Employed’ were much more than the loans with other kinds of status, the Modified Employment Status was probably not a very good variable, if I want to explore its relationship with LoanOriginalAmount.

2.2 Analysis

What is the structure of your dataset?

There are 113,937 records in my dataset, with 11 variables. Seven variables are numeric variables, including LoanOriginalAmount, MonthlyLoanPayment, Investors, LP_CustomerPayments, LP_ServiceFees, Term, and CreditScoreRangeLower. Four variables are factors; their names and levels are as follows:

IncomeRange: Not employed, $0, $1-24,999, $25,000-49,999, $50,000-74,999, $75,000-99,999, $100,000+, Not displayed IsBorrowerHomeowner: Ture, False
LoanOriginalQuarter: Q1, Q2, Q3, Q4
EmploymentStatus2: Employed, Retired, Not employed, Other, Not available, NA (created based on EmploymentStatus)

Other observations:
1. The numeric variables are correlated with LoanOriginalAmount, with the magnitude of correlation approximately between 0.3 and 0.9.
2. The distribution of most numeric variables are not normal, and are on very different scales. Therefore, standardization and log transformation were used on LoanOriginalAmount, MonthlyLoanPayment, Investors, LP_CustomerPayments LP_ServiceFees, and CreditScoreRangeLower.
3. In the modified Employment Status (EmploymentStatus2), the ‘Employed’ status takes a proportion of more than 80%, therefore it might not be a very good predictor.

What is/are the main feature(s) of interest in your dataset? What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The main feature of interest is the origination amount of the loan (LoanOriginalAmount). I’d like to see which features are best for predicting the original amount of loan. I have already selected numeric variables which are moderately correlated to LoanOriginalAmount, and factors without extreme distributions, and I want to see whether some combination of these variables can be predict the original amount of loan.

Did you create any new variables from existing variables in the dataset?

Two new variables were created:
1. LoanOriginalQuarter: created from LoanOriginateQuarter. The new variable only contains in which quarter (Q1, Q2, Q3 or Q4) the loan was made, without the year information.
2. EmploymentStatus2: created from EmploymentStatus. The original variable includes ‘Employed’, ‘Full-time’, ‘Part-time’, ‘Self-employed’ and several other levels. However this is confusing because obviously ‘Employed’ should include the latter three levels. Therefore in the new variable, the level ‘Full-time’, ‘Part-time’ and ‘Self-employed’ were all relabeled as ‘Employed’.

3. Bivariate Analysis

3.1 Bivariate Plots

From now on, I will use the new data set created from the previous step which contains transformed numeric variables and adjusted factors.

Generally explore the relationship between the variables in the dataset:

After standardization and log transformation, some correlations were different from the correlation with original variables. Next, I will used plots to show the relationship between the transformed variables:

Although the correlation between LoanOriginalAmount and MonthlyLoanPayment is high (above 0.9), this plot shows that the relationship between the two variables is probably moderated by a third variable. It looks that three regression lines are needed here to capture the linear relationship between LoanOriginalAmount and MonthlyLoanPayment.

This plot shows a weak positive correlation between the two variables.

This plot shows a weak-to-medium correlation between the two variables.

Scatter plot to show the relationship between ServiceFees and LoanOriginalAmount:

This plot shows a medium negative correlation between the two variables.

This plot shows a positive medium correlation between the two variables. It looks that the correlation was affected by extreme values.

This plot shows that the amount of loan for 12, 36 and 60 months of term was different, with more amount corresponding to longer term.

The general trend was that with higher Income, the mean and variance of loan amount was also higher. The exception was at the income of 0. This might be because people receive 0 in their income for different reasons (temparory job, internship, etc.), so for the zero-income people their loan amount may be more likely to depend on other variables.

The loan amount for home owners and non home owners were not very different from each other, although larger variance of loan amount was found for home owners.

The loan amount in different quarter was not very different from one another.

From this plot, the loan amount of employed borrowers has larger variance and higher mean compared to that of borrowers with other employment status.

In addition to the main variable (LoanOriginalAmount) that I focused on, I also looked into the relationship between other variables, and found a relationship between pre charge-off cumulative gross payments made by the borrower on the loan (LP_CustomerPayments) and Cumulative service fees paid by the investors who have invested in the loan (LP_ServiceFees).

3.2 Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Observed relationships are as follows:
1. The plots revealed a strong positive correlation between LoanOriginalAmount and MonthlyLoanPayment, which means that people with larger amount of original loan tend to schedule larger amount of monthly loan payment.
2. A weak positive correlation was found between LoanOriginalAmount and the number of investors that funded the loan, which shows that the amount of loan tends to be larger with more investors.
3. A weak positive correlation was found between LoanOriginalAmount and pre charge-off cumulative gross payments made by the borrower (LP_CustomerPayments), which means that the amount of loan tends to be larger if the borrower has made higher pre charge-off payment.
4. A medium negative correlation was found between LoanOriginalAmount and cumulative service fees paid by the investors who have invested in the loan, which means that the amount of loan tends to be larger if the investors have paid more service fees.
5. A weak positive correlation was found between LoanOriginalAmount and the lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency. It means that borrowers with higher credit score tend to have larger amount of loan.
6. Larger amount of loan tends to have longer term.
7. In general, more income is related to larger amount of loan.
8. Borrowers who are employed tend to have larger amount of loan.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

An interesting relationship is that there is a relatively strong negative correlation between pre charge-off cumulative gross payments made by the borrower on the loan and cumulative service fees paid by the investors who have invested in the loan. Therefore, if the borrower pay more for the pre charge-off fees, the investors will pay less for the service fees.

What was the strongest relationship you found?

The strongest relationship was the correlation between LoanOriginalAmount and MonthlyLoanPayment.

4. Multivariate Analysis

4.1 Multivariate Plots

The scatter plot of MonthlyLoanPayment and LoanOriginalAmount showed that three lines were possibly needed to capture the linear relationship between the two variables. Since Term was the only variable with three levels, I first investigated whether different levels of Term would affect the relationship between the two variables.

From the plot, the correlation between LoanOriginalAmount and MonthlyLoanPayment exhibited very good linear relationship on each level of Term. Therefore there should be a interaction between Term and MonthlyLoanPayment when predicting LoanOriginalAmount.

To see whether the relationship between Investors and LoanOriginalAmount is moderated by other variables, I tried several moderators and the plots were as follows:

From the above plots, the relationship between Investors and LoanOriginalAmount might be moderated by Term and IncomeRange. Other variables did not show a strong moderating effect.

To see whether the relationship between LP_CustomerPayments and LoanOriginalAmount is moderated by other variables:

Although in some of the above plots, the regression lines on different levels of a thid variable showed different slopes, more careful examination of the patterns of the dots suggested that most of the different slopes were probably driven by some outliers, or the extreme unbalance between the number of datum in each condition, but not the difference in the patterns of the relationship. Therefore, the only variables was found to moderate the relationship between LP_CustomerPayments and LoanOriginalAmount was LoanOriginalQuarter.

To see whether the relationship between LP_ServiceFees and LoanOriginalAmount is moderated by other variables:

This situation was similar to the previous one. No variable was found to moderate the relationship between LP_ServiceFees and LoanOriginalAmount.

To see whether the relationship between CreditScoreRangeLower and LoanOriginalAmount was moderated by other variables:

The situation looked like the ones with LP_ServiceFees, no other variable was found to moderate the relationship between CreditScoreRangeLower and LoanOriginalAmount.

It seems that there is an interaction between Term and whether the borrower is a home owner.

It seems that the length of term did not affect the relationship between income range and loan amount.

Again, the length of term did not affect the relationship between loan amount and in which quarter the loan was created.

It seems that whether the borrower is a home owner did not affect the relationship between the income range and the amount of loan.

It seems that in which quarter the loan was created did not affect the relationship between borrowers’ income range and the amount of loan.

4.2 Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there any interesting or surprising interactions between features?

The observed moderating effects/interactions are as follows:
1. The relationship between loan amount and monthly payment was moderated by the term of the loan. With longer loan term, larger amount of loan tend to relate even more monthly payment.
2. The relationship between loan amount and number of investors was moderated by term length and borrowers’ income. In general, more investors is related to larger amount of loan. With longer term, more investors is still related to larger amount of loan, but not as large as that with shorter term. On the other hand, with higher income, more investors is also related to larger amount of loan but the amount is not that large as for borrowers with less income.
3. An interaction was found between cumulative service fees paid by the investors who had invested in the loan and in which quarter the loan was created.
4. An interaction was found between loan term and whether the borrower was a home owner.

Model with my dataset:

Based on the univariate, bivariate and mutivariate analysis, I’m going to create a multiple regression model to predict original loan amount, including the variables that are correlated to LoanOriginalAmount, and the interactions found from the plots. I will add the variables to the model one after another, and then add interactions, meanwhile comparing the models using ANOVA. One noteworthy thing is that although there is an apparant interaction between IsBorrowerHomeowner and Term, I will not include this in the model, for the variable IsBorrowerHomeowner itself seems not to be a good predictor based on other plots, therefore the interaction should not be included. The situation with Quarter is similar: although an interaction between LP_ServiceFees and LoanOriginalQuarter was found, based on the bivariate plot, Quarter does not affect loan amount, therefore quarter and its interaction will not be included in the model.

## Analysis of Variance Table
## 
## Model  1: LoanOriginalAmount ~ MonthlyLoanPayment
## Model  2: LoanOriginalAmount ~ MonthlyLoanPayment + Investors
## Model  3: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments
## Model  4: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees
## Model  5: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term
## Model  6: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower
## Model  7: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange
## Model  8: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     EmploymentStatus2
## Model  9: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     EmploymentStatus2 + MonthlyLoanPayment:Term
## Model 10: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     EmploymentStatus2 + MonthlyLoanPayment:Term + Investors:IncomeRange
## Model 11: LoanOriginalAmount ~ MonthlyLoanPayment + Investors + LP_CustomerPayments + 
##     LP_ServiceFees + Term + CreditScoreRangeLower + IncomeRange + 
##     EmploymentStatus2 + MonthlyLoanPayment:Term + Investors:IncomeRange + 
##     Investors:Term
##    Res.Df    RSS Df Sum of Sq          F    Pr(>F)    
## 1  111678 466.18                                      
## 2  111677 466.13  1     0.048 2.9019e+01 7.181e-08 ***
## 3  111676 460.89  1     5.239 3.1482e+03 < 2.2e-16 ***
## 4  111675 423.39  1    37.501 2.2536e+04 < 2.2e-16 ***
## 5  111674 235.34  1   188.048 1.1301e+05 < 2.2e-16 ***
## 6  111673 230.46  1     4.879 2.9319e+03 < 2.2e-16 ***
## 7  111666 222.78  7     7.689 6.6006e+02 < 2.2e-16 ***
## 8  111662 222.44  4     0.337 5.0602e+01 < 2.2e-16 ***
## 9  111661 187.78  1    34.657 2.0827e+04 < 2.2e-16 ***
## 10 111654 187.69  7     0.087 7.4742e+00 5.075e-09 ***
## 11 111653 185.80  1     1.900 1.1415e+03 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Calls:
## m1: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment, data = newdata_comp)
## m2: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors, 
##     data = newdata_comp)
## m3: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments, data = newdata_comp)
## m4: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees, data = newdata_comp)
## 
## ===========================================================================
##                            m1           m2           m3           m4       
## ---------------------------------------------------------------------------
##   (Intercept)            0.016***     0.016***     0.012***     0.616***   
##                         (0.000)      (0.000)      (0.000)      (0.006)     
##   MonthlyLoanPayment     0.938***     0.938***     0.946***     0.910***   
##                         (0.001)      (0.001)      (0.001)      (0.001)     
##   Investors                           0.002***     0.013***     0.006***   
##                                      (0.001)      (0.001)      (0.001)     
##   LP_CustomerPayments                             -0.021***    -0.056***   
##                                                   (0.001)      (0.001)     
##   LP_ServiceFees                                               -0.579***   
##                                                                (0.006)     
## ---------------------------------------------------------------------------
##   R-squared                  0.896        0.896        0.897        0.906  
##   adj. R-squared             0.896        0.896        0.897        0.906  
##   sigma                      0.065        0.065        0.064        0.062  
##   F                     965265.838   482684.380   325867.434   268518.131  
##   p                          0.000        0.000        0.000        0.000  
##   Log-likelihood        147470.456   147476.241   148107.369   152846.341  
##   Deviance                 466.179      466.131      460.892      423.391  
##   AIC                  -294934.913  -294944.482  -296204.738  -305680.683  
##   BIC                  -294906.043  -294905.988  -296156.621  -305622.943  
##   N                     111680       111680       111680       111680      
## ===========================================================================
## 
## Calls:
## m4: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees, data = newdata_comp)
## m5: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term, data = newdata_comp)
## m6: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower, 
##     data = newdata_comp)
## m7: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange, data = newdata_comp)
## 
## ================================================================================================
##                                                 m4           m5           m6           m7       
## ------------------------------------------------------------------------------------------------
##   (Intercept)                                 0.616***     0.082***    -0.051***    -0.081***   
##                                              (0.006)      (0.005)      (0.006)      (0.006)     
##   MonthlyLoanPayment                          0.910***     0.890***     0.882***     0.870***   
##                                              (0.001)      (0.001)      (0.001)      (0.001)     
##   Investors                                   0.006***     0.020***     0.018***     0.018***   
##                                              (0.001)      (0.000)      (0.000)      (0.000)     
##   LP_CustomerPayments                        -0.056***    -0.022***    -0.023***    -0.024***   
##                                              (0.001)      (0.001)      (0.001)      (0.001)     
##   LP_ServiceFees                             -0.579***    -0.216***    -0.217***    -0.228***   
##                                              (0.006)      (0.005)      (0.004)      (0.004)     
##   Term                                                     0.004***     0.004***     0.004***   
##                                                           (0.000)      (0.000)      (0.000)     
##   CreditScoreRangeLower                                                 0.133***     0.153***   
##                                                                        (0.003)      (0.003)     
##   IncomeRange: $0/Not employed                                                       0.028***   
##                                                                                     (0.002)     
##   IncomeRange: $1-24,999/Not employed                                                0.012***   
##                                                                                     (0.002)     
##   IncomeRange: $25,000-49,999/Not employed                                           0.015***   
##                                                                                     (0.002)     
##   IncomeRange: $50,000-74,999/Not employed                                           0.024***   
##                                                                                     (0.002)     
##   IncomeRange: $75,000-99,999/Not employed                                           0.028***   
##                                                                                     (0.002)     
##   IncomeRange: $100,000+/Not employed                                                0.035***   
##                                                                                     (0.002)     
##   IncomeRange: Not displayed/Not employed                                            0.043***   
##                                                                                     (0.002)     
## ------------------------------------------------------------------------------------------------
##   R-squared                                       0.906        0.948        0.949        0.950  
##   adj. R-squared                                  0.906        0.948        0.949        0.950  
##   sigma                                           0.062        0.046        0.045        0.045  
##   F                                          268518.131   404302.618   344442.272   164746.083  
##   p                                               0.000        0.000        0.000        0.000  
##   Log-likelihood                             152846.341   185638.532   186808.313   188702.986  
##   Deviance                                      423.391      235.343      230.464      222.775  
##   AIC                                       -305680.683  -371263.065  -373600.626  -377375.972  
##   BIC                                       -305622.943  -371195.701  -373523.639  -377231.621  
##   N                                          111680       111680       111680       111680      
## ================================================================================================
## 
## Calls:
## m7: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange, data = newdata_comp)
## m8: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + EmploymentStatus2, data = newdata_comp)
## m9: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + EmploymentStatus2 + MonthlyLoanPayment:Term, 
##     data = newdata_comp)
## 
## ====================================================================================
##                                                  m7           m8           m9       
## ------------------------------------------------------------------------------------
##   (Intercept)                                 -0.081***    -0.092***    -0.055***   
##                                               (0.006)      (0.010)      (0.009)     
##   MonthlyLoanPayment                           0.870***     0.870***     0.506***   
##                                               (0.001)      (0.001)      (0.003)     
##   Investors                                    0.018***     0.018***     0.019***   
##                                               (0.000)      (0.000)      (0.000)     
##   LP_CustomerPayments                         -0.024***    -0.025***    -0.022***   
##                                               (0.001)      (0.001)      (0.000)     
##   LP_ServiceFees                              -0.228***    -0.229***    -0.161***   
##                                               (0.004)      (0.004)      (0.004)     
##   Term                                         0.004***     0.004***     0.001***   
##                                               (0.000)      (0.000)      (0.000)     
##   CreditScoreRangeLower                        0.153***     0.155***     0.161***   
##                                               (0.003)      (0.003)      (0.003)     
##   IncomeRange: $0/Not employed                 0.028***     0.039***     0.035***   
##                                               (0.002)      (0.009)      (0.008)     
##   IncomeRange: $1-24,999/Not employed          0.012***     0.023**      0.021**    
##                                               (0.002)      (0.008)      (0.008)     
##   IncomeRange: $25,000-49,999/Not employed     0.015***     0.026**      0.024**    
##                                               (0.002)      (0.008)      (0.008)     
##   IncomeRange: $50,000-74,999/Not employed     0.024***     0.034***     0.031***   
##                                               (0.002)      (0.008)      (0.008)     
##   IncomeRange: $75,000-99,999/Not employed     0.028***     0.039***     0.034***   
##                                               (0.002)      (0.008)      (0.008)     
##   IncomeRange: $100,000+/Not employed          0.035***     0.045***     0.040***   
##                                               (0.002)      (0.008)      (0.008)     
##   IncomeRange: Not displayed/Not employed      0.043***     0.047***     0.042***   
##                                               (0.002)      (0.009)      (0.008)     
##   EmploymentStatus2: Retired/Employed                       0.008***     0.005***   
##                                                            (0.002)      (0.001)     
##   EmploymentStatus2: Not employed/Employed                  0.011        0.007      
##                                                            (0.008)      (0.008)     
##   EmploymentStatus2: Other/Employed                        -0.009***    -0.007***   
##                                                            (0.001)      (0.001)     
##   EmploymentStatus2: Not available/Employed                 0.007        0.005      
##                                                            (0.004)      (0.004)     
##   MonthlyLoanPayment x Term                                              0.010***   
##                                                                         (0.000)     
## ------------------------------------------------------------------------------------
##   R-squared                                        0.950        0.951        0.958  
##   adj. R-squared                                   0.950        0.951        0.958  
##   sigma                                            0.045        0.045        0.041  
##   F                                           164746.083   126178.486   142305.802  
##   p                                                0.000        0.000        0.000  
##   Log-likelihood                              188702.986   188787.474   198245.116  
##   Deviance                                       222.775      222.439      187.782  
##   AIC                                        -377375.972  -377536.949  -396450.231  
##   BIC                                        -377231.621  -377354.104  -396257.764  
##   N                                           111680       111680       111680      
## ====================================================================================
## 
## Calls:
## m9: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + EmploymentStatus2 + MonthlyLoanPayment:Term, 
##     data = newdata_comp)
## m10: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + EmploymentStatus2 + MonthlyLoanPayment:Term + 
##     Investors:IncomeRange, data = newdata_comp)
## m11: lm(formula = LoanOriginalAmount ~ MonthlyLoanPayment + Investors + 
##     LP_CustomerPayments + LP_ServiceFees + Term + CreditScoreRangeLower + 
##     IncomeRange + EmploymentStatus2 + MonthlyLoanPayment:Term + 
##     Investors:IncomeRange + Investors:Term, data = newdata_comp)
## 
## ===============================================================================================
##                                                             m9           m10          m11      
## -----------------------------------------------------------------------------------------------
##   (Intercept)                                            -0.055***    -0.049***    -0.016      
##                                                          (0.009)      (0.009)      (0.009)     
##   MonthlyLoanPayment                                      0.506***     0.505***     0.486***   
##                                                          (0.003)      (0.003)      (0.003)     
##   Investors                                               0.019***     0.028***     0.072***   
##                                                          (0.000)      (0.005)      (0.005)     
##   LP_CustomerPayments                                    -0.022***    -0.022***    -0.024***   
##                                                          (0.000)      (0.000)      (0.000)     
##   LP_ServiceFees                                         -0.161***    -0.164***    -0.180***   
##                                                          (0.004)      (0.004)      (0.004)     
##   Term                                                    0.001***     0.001***     0.001***   
##                                                          (0.000)      (0.000)      (0.000)     
##   CreditScoreRangeLower                                   0.161***     0.160***     0.159***   
##                                                          (0.003)      (0.003)      (0.003)     
##   IncomeRange: $0/Not employed                            0.035***     0.034***     0.032***   
##                                                          (0.008)      (0.008)      (0.008)     
##   IncomeRange: $1-24,999/Not employed                     0.021**      0.020**      0.019*     
##                                                          (0.008)      (0.008)      (0.008)     
##   IncomeRange: $25,000-49,999/Not employed                0.024**      0.024**      0.022**    
##                                                          (0.008)      (0.008)      (0.008)     
##   IncomeRange: $50,000-74,999/Not employed                0.031***     0.030***     0.029***   
##                                                          (0.008)      (0.008)      (0.008)     
##   IncomeRange: $75,000-99,999/Not employed                0.034***     0.033***     0.032***   
##                                                          (0.008)      (0.008)      (0.008)     
##   IncomeRange: $100,000+/Not employed                     0.040***     0.039***     0.037***   
##                                                          (0.008)      (0.008)      (0.008)     
##   IncomeRange: Not displayed/Not employed                 0.042***     0.041***     0.039***   
##                                                          (0.008)      (0.008)      (0.008)     
##   EmploymentStatus2: Retired/Employed                     0.005***     0.005**      0.004**    
##                                                          (0.001)      (0.001)      (0.001)     
##   EmploymentStatus2: Not employed/Employed                0.007        0.008        0.008      
##                                                          (0.008)      (0.008)      (0.008)     
##   EmploymentStatus2: Other/Employed                      -0.007***    -0.007***    -0.006***   
##                                                          (0.001)      (0.001)      (0.001)     
##   EmploymentStatus2: Not available/Employed               0.005        0.006        0.006      
##                                                          (0.004)      (0.004)      (0.004)     
##   MonthlyLoanPayment x Term                               0.010***     0.010***     0.010***   
##                                                          (0.000)      (0.000)      (0.000)     
##   Investors x IncomeRange: $0/Not employed                             0.004       -0.000      
##                                                                       (0.007)      (0.007)     
##   Investors x IncomeRange: $1-24,999/Not employed                     -0.004       -0.006      
##                                                                       (0.005)      (0.005)     
##   Investors x IncomeRange: $25,000-49,999/Not employed                -0.008       -0.008      
##                                                                       (0.005)      (0.005)     
##   Investors x IncomeRange: $50,000-74,999/Not employed                -0.011*      -0.010      
##                                                                       (0.005)      (0.005)     
##   Investors x IncomeRange: $75,000-99,999/Not employed                -0.009       -0.008      
##                                                                       (0.005)      (0.005)     
##   Investors x IncomeRange: $100,000+/Not employed                     -0.009       -0.008      
##                                                                       (0.005)      (0.005)     
##   Investors x IncomeRange: Not displayed/Not employed                 -0.001       -0.005      
##                                                                       (0.006)      (0.006)     
##   Investors x Term                                                                 -0.001***   
##                                                                                    (0.000)     
## -----------------------------------------------------------------------------------------------
##   R-squared                                                   0.958        0.958        0.959  
##   adj. R-squared                                              0.958        0.958        0.959  
##   sigma                                                       0.041        0.041        0.041  
##   F                                                      142305.802   102503.349    99611.605  
##   p                                                           0.000        0.000        0.000  
##   Log-likelihood                                         198245.116   198271.011   198839.016  
##   Deviance                                                  187.782      187.695      185.795  
##   AIC                                                   -396450.231  -396488.022  -397622.033  
##   BIC                                                   -396257.764  -396228.191  -397352.578  
##   N                                                      111680       111680       111680      
## ===============================================================================================

The ANOVA test shows that all variables and interactions in the last model (m11) significantly improve the model, therefore all of them should be kept in the model. From the above table of regression models, we can see that MonthlyLoanPayment alone can explain 89.6% of the variance of LoanOriginalAmount. This is consistent with our plots, indicating that MonthlyLoanPayment is the strongest predictor. The second strong predictor is Term, which means that longer loan term is related to larger amount of loan. The final model explains 95.9% of the variance of loan amount.

Final Plots and Summary

Plot One

Description One

LoanOriginalAmount is the predicted variable in this dataset. In this Plot One, this variable has been standardized and log transformed, because the original variable was right skewed, and was not on the same scale with all the other numeric variables in the dataset.

Plot Two

Description Two

This Plot two describes the strongest correlation related to the predicted variable (r > 0.9). Also, according to the plot, the relationship may be better captured by multiple regression lines, indicating that a third variable may moderate the relationship between monthly loan payment and loan amount.

Plot Three

Description Three

This Plot Three confirms my speculation from Plot Two. The relationship between Monthly Payment and Loan Amount was moderated by Loan Term, which had three levels: 12 months, 36 months and 60 months. According to Plot Three, with longer loan term, the correlation between monthly payment and loan amount tends to be stronger.

Reflection

The series of analysis have shown that the original amount of loan is related to many other variables. This dataset contains six numeric variables in addition to the amount of loan, as well as four factors. After examining the relationship between loan amount and these other variables with plots and multiple linear regression, I found that except for whether the borrower was a home owner and in which quarter the loan was created, all other variables predicted the amount of loan.

A strong positive correlation between LoanOriginalAmount and MonthlyLoanPayment was found, which means that people with larger amount of original loan are more likely to schedule larger amount of monthly loan payment. A further analysis showed that this relationship was moderated by the length of loan term. With longer loan term, larger amount of loan tends to relate to even more monthly payment.

The amount of loan also tends to be larger with more investors, and this relationship was found to be moderated by both loan term and borrow’s income. With longer loan term, the correlation between number of investors and loan amount was not as strong as with shorter term; Similarly, with higher borrower’s income, the correlation was not as strong as with less income.

In general, higher income the borrower had, the larger amount of loan they tend to borrow. The employed borrowers and borrowers with higher credit score also tend to have larger amount of loan compared to borrowers with other employment status. Loan with longer term is also more likely has larger amount.

The amount of loan also tends to be larger if the borrower has made higher pre charge-off payment, or if the investors have paid more service fees.

Although the model has shown that using these variables and the interactions between several of them could predict a high proportion of the variance of loan amount, it still has limitations. Monthly payment explains more than 89% of the variance of loan amount and seems to be a very good predictor; However, there might be some variables that could predict both loan amount and monthly payment. For example, one could think that the borrower’s average expense per month, their marital status, how many children they have, and their estates, stock and other properties should also predict both their monthly payment and loan amount. These variables are not include in the original dataset, for the original dataset does not focus on more personal information of the borrowers. I think with more information considered, the model can be much improved.